Tomorrow's Rain Prediction¶
In the following notebook we evaluate two different classifier models (Random Forest and Gradient Boosting) for predicting wether the following day it's going to rain or not.
When executing, the data is taken live from the ADLS account hosting the rain_prediction.parquet dataset. This dataset is updated daily via an automated data pipeline that ingests data from the OpenWeatherMaps API. The ingested data is later transformed until it lands, mostly clean in the before mentioned dataset.
We start by importing the required libraries and creating a helper class for interacting with polars dataframes and plotly figures.
# Standard lib
import sys
from pathlib import Path
# External libs
import polars as pl
import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
from plotly.subplots import make_subplots
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from imblearn.over_sampling import ADASYN
from matplotlib import pyplot as plt
# Custom lib
sys.path.insert(0, "../openweather-functions/")
from src.destinations.adls import ADLS
# Requires a ./.env file with the following variables in order to authenticate to ADLS
# AZURE_CLIENT_SECRET=""
# AZURE_CLIENT_ID=""
# AZURE_TENANT_ID=""
# AZURE_ACCOUNT_NAME=""
# AZURE_CONTAINER_NAME=""
_ = load_dotenv()
# Required for exporting the plotly figures into HTML
pio.renderers.default = "notebook"
class Label(str):
"""Helper class for pretty printing labels in with plotly"""
def __init__(self, label):
super(Label, self).__init__()
self.name = label
self.pretty = label.replace("_", " ").capitalize()
Dataset exploration and preparation¶
Download the dataset and create a polars dataframe.
file_name = "rain_prediction.parquet"
adls = ADLS(directory="ml")
try:
adls.download_file(file_name, file_name)
df = pl.read_parquet(file_name)
finally:
Path(file_name).unlink(missing_ok=True)
df.head()
ADLS: Using default envrionment credentials ADLS: Initializing DataLakeServiceClient
| recorded_day | location | clouds | temperature_feels_like | avg_temperature | max_temperature | min_temperature | humidity | pressure | degrees | wind_gusts | wind_speed | rain | max_rain | rain_prediction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | str | f64 | f64 | f64 | f32 | f32 | f64 | f64 | f64 | f64 | f64 | f64 | f32 | str |
| 2024-12-31 | "valencia" | 75.0 | 9.059998 | 9.929993 | 10.899994 | 8.899994 | 93.0 | 1031.0 | 330.0 | null | 2.06 | 2.73 | 2.73 | "moderate" |
| 2025-01-01 | "valencia" | 70.0 | 10.171673 | 10.892921 | 14.480011 | 7.240021 | 88.083333 | 1029.583333 | 258.416667 | 2.817 | 2.203333 | 5.05 | 2.73 | "dry" |
| 2025-01-02 | "valencia" | 6.75 | 10.032922 | 11.140424 | 17.820007 | 5.470001 | 73.916667 | 1026.083333 | 257.75 | 3.716875 | 3.2575 | null | null | "light" |
| 2025-01-03 | "valencia" | 10.458333 | 11.807924 | 13.062091 | 19.450012 | 6.240021 | 54.291667 | 1021.916667 | 273.75 | 4.917143 | 3.880833 | 1.81 | 0.87 | "dry" |
| 2025-01-04 | "valencia" | 11.083333 | 12.952504 | 14.05084 | 20.040009 | 10.240021 | 55.208333 | 1019.958333 | 261.25 | 4.9725 | 3.672083 | null | null | "dry" |
The dataset consists of 15 columns. We are going to use 10 as predictors and 1 (rain_prediction) to create a column to be predicted. We are going to ignore the recorded_day and location columns, since predictions are going to have the information of the current date to predict the next one.
df = df.fill_null(0.0)
min_date = df["recorded_day"].min()
max_date = df["recorded_day"].max()
unique_locations = len(
df.select(pl.col("location").str.replace_all("_", " ").str.to_uppercase()).unique()
)
print(
f"Dataset used includes data from {min_date.isoformat()} to {max_date.isoformat()} "
f"from {unique_locations} distinct locations for a total of {len(df)} entries."
)
Dataset used includes data from 2024-12-31 to 2025-12-12 from 17 distinct locations for a total of 6002 entries.
Plot the distributions of predictive variables plus the rain variables for visual inspection. Note that the rain variables are not going to be used as predictors, and that the predicted variable is going to be created from the rain_prediction column instead.
x_labels = [
Label(x)
for x in [
"clouds",
"temperature_feels_like",
"avg_temperature",
"max_temperature",
"min_temperature",
"humidity",
"pressure",
"degrees",
"wind_gusts",
"wind_speed",
]
]
plot_labels = x_labels + [Label("rain"), Label("max_rain")]
ROWS = 3
COLS = 4
fig = make_subplots(
rows=ROWS, cols=COLS, subplot_titles=[label.pretty for label in plot_labels]
)
for idx, label in enumerate(plot_labels):
fig.add_trace(
go.Histogram(x=df[label], nbinsx=100),
row=int(idx / COLS) + 1,
col=(idx % COLS) + 1,
)
fig.update_layout(showlegend=False, height=500, width=1000)
fig.show()
Standardize predictor variables to generate the independent variable set.
X = df.select(
(pl.col(label) - pl.col(label).mean()) / pl.col(label).std() for label in x_labels
)
X.head()
| clouds | temperature_feels_like | avg_temperature | max_temperature | min_temperature | humidity | pressure | degrees | wind_gusts | wind_speed |
|---|---|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | f32 | f32 | f64 | f64 | f64 | f64 | f64 |
| 1.246957 | -0.996281 | -1.028313 | -1.383602 | -0.509039 | 1.477731 | 2.423094 | 1.999844 | -1.160838 | -0.672577 |
| 1.080721 | -0.844899 | -0.886313 | -0.939287 | -0.773078 | 1.171419 | 2.174604 | 1.038449 | -0.407171 | -0.5937 |
| -1.022164 | -0.863793 | -0.849815 | -0.524761 | -1.054621 | 0.288827 | 1.560687 | 1.029495 | -0.166416 | -0.013588 |
| -0.898872 | -0.622081 | -0.566433 | -0.322462 | -0.93214 | -0.933824 | 0.829834 | 1.244382 | 0.154707 | 0.329435 |
| -0.878093 | -0.466217 | -0.420625 | -0.249237 | -0.295892 | -0.876715 | 0.486332 | 1.076501 | 0.169517 | 0.214559 |
Generate the variable to predict. In this case the rain_prediction column, which can be dry, light, moderate and intense is aggregated into either dry or rain. This variable indicates wether the following day rained or not and it is what we are going to train our models to predict. Furthermore, create another column that will have 0 or 1 values to indicate the same thing.
df = df.with_columns(
pl.when(pl.col("rain_prediction").is_in(("light", "moderate", "intense")))
.then(pl.lit("rain"))
.otherwise(pl.lit("dry"))
.alias("simple_rain_prediction"),
).with_columns(
pl.when(pl.col("simple_rain_prediction") == pl.lit("rain"))
.then(1)
.otherwise(0)
.alias("b_simple_rain_prediction")
)
y = df.select("b_simple_rain_prediction")
df.select("simple_rain_prediction", "b_simple_rain_prediction").head()
| simple_rain_prediction | b_simple_rain_prediction |
|---|---|
| str | i32 |
| "rain" | 1 |
| "dry" | 0 |
| "rain" | 1 |
| "dry" | 0 |
| "dry" | 0 |
Evaluate the balancedness of the dataset. Currently, the data is obtained from different locations in Spain, therefore, it is expected to be imbalanced towards the dry days.
label_count = (
df.select(pl.col("simple_rain_prediction"))
.group_by(pl.col("simple_rain_prediction"))
.agg(pl.col("simple_rain_prediction").count().alias("count"))
)
fig = px.bar(
label_count,
x="simple_rain_prediction",
y="count",
title="Label Distribution in Dataset Before Resampling",
labels={"simple_rain_prediction": "Simple Rain Prediction", "count": "Count"},
width=500,
height=500
)
fig.show()
Separeate the dataset into train and test datasets (80/20).
Furthermore, use ADASYN to synthetically balance the training dataset.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=10
)
X_train_resampled, y_train_resampled = ADASYN().fit_resample(X_train.to_numpy(), y_train.to_numpy())
print(f"Total: {len(X)}")
print(f"Training: {len(y_train)}")
print(f"Resampled: {len(y_train_resampled)}")
print(f"Test: {len(y_test)}")
Total: 6002 Training: 4801 Resampled: 7338 Test: 1201
Plot again the count of dry and rain cases for the resampled training dataset.
label_count_resampled = pl.DataFrame({
"simple_rain_prediction": ["dry", "rain"],
"count": [y_train_resampled.size - y_train_resampled.sum(), y_train_resampled.sum()],
})
fig = px.bar(
label_count_resampled,
x="simple_rain_prediction",
y="count",
title="Label Distribution in Dataset After Resampling",
labels={"simple_rain_prediction": "Simple Rain Prediction", "count": "Count"},
width=500,
height=500
)
fig.show()
Model training and evaluation¶
We create an evaluator factory with the obtained datasets. This will make it much easier to create evaluations with different models.
def get_model_evaluator(X_train, X_test, y_train, y_test):
def evaluate_model(model, name):
model = model.fit(X_train, y_train.to_numpy().ravel())
confusion_mat = confusion_matrix(y_test, model.predict(X_test))
tn, fp = confusion_mat[0]
fn, tp = confusion_mat[1]
total = confusion_mat.sum()
p = tp + fn
n = tn + fp
tpr = sensitivity = tp / p
tnr = specificity = tn / n
precision = tp / (tp + fp)
accuracy = (tp + tn) / total
balanced_accuracy = (tpr + tnr) / 2
metrics = {
"accuracy": accuracy,
"balanced_accuracy": balanced_accuracy,
"sensitivity": sensitivity,
"accuracy": accuracy,
"precision": precision,
}
print(f"EVALUATION METRICS FOR '{name}'")
print("===============================")
print(f"| Accuracy | {accuracy:.3%} |")
print("===============================")
print(f"| Balanced Accuracy | {balanced_accuracy:.3%} |")
print("===============================")
print(f"| Precision (PPV) | {precision:.3%} | ")
print("===============================")
print(f"| Sensitivity (TPR) | {sensitivity:.3%} | ")
print("===============================")
print(f"| Specificity (TNR) | {specificity:.3%} |")
print("===============================")
fig, ax = plt.subplots(1, 1)
disp = ConfusionMatrixDisplay(confusion_mat, display_labels=["dry", "rain"])
disp.plot(ax=ax)
_ = ax.set_title(name)
return fig, ax, metrics
return evaluate_model
model_evaluator = get_model_evaluator(X_train, X_test, y_train, y_test)
Use the evaluator function to train and evaluate the two selected models.
First the Random Forest.
_ = model_evaluator(RandomForestClassifier(n_estimators=100), "Random Forest")
EVALUATION METRICS FOR 'Random Forest' =============================== | Accuracy | 78.102% | =============================== | Balanced Accuracy | 57.817% | =============================== | Precision (PPV) | 50.893% | =============================== | Sensitivity (TPR) | 21.509% | =============================== | Specificity (TNR) | 94.124% | ===============================
Secondly, the Gradient Boosting.
_ = model_evaluator(GradientBoostingClassifier(n_estimators=100), "Gradient Boosting")
EVALUATION METRICS FOR 'Gradient Boosting' =============================== | Accuracy | 78.601% | =============================== | Balanced Accuracy | 57.731% | =============================== | Precision (PPV) | 54.000% | =============================== | Sensitivity (TPR) | 20.377% | =============================== | Specificity (TNR) | 95.085% | ===============================
Results evaluation¶
TODO